KnowItNow: Fast, Scalable Information Extraction from the Web
نویسندگان
چکیده
Numerous NLP applications rely on search-engine queries, both to extract information from and to compute statistics over the Web corpus. But search engines often limit the number of available queries. As a result, query-intensive NLP applications such as Information Extraction (IE) distribute their query load over several days, making IE a slow, offline process. This paper introduces a novel architecture for IE that obviates queries to commercial search engines. The architecture is embodied in a system called KNOWITNOW that performs high-precision IE in minutes instead of days. We compare KNOWITNOW experimentally with the previouslypublished KNOWITALL system, and quantify the tradeoff between recall and speed. KNOWITNOW’s extraction rate is two to three orders of magnitude higher than KNOWITALL’s. 1 Background and Motivation Numerous modern NLP applications use the Web as their corpus and rely on queries to commercial search engines to support their computation (Turney, 2001; Etzioni et al., 2005; Brill et al., 2001). Search engines are extremely helpful for several linguistic tasks, such as computing usage statistics or finding a subset of web documents to analyze in depth; however, these engines were not designed as building blocks for NLP applications. As a result, the applications are forced to issue literally millions of queries to search engines, which limits the speed, scope, and scalability of the applications. Further, the applications must often then fetch some web documents, which at scale can be very time-consuming. In response to heavy programmatic search engine use, Google has created the “Google API” to shunt programmatic queries away from Google.com and has placed hard quotas on the number of daily queries a program can issue to the API. Other search engines have also introduced mechanisms to limit programmatic queries, forcing applications to introduce “courtesy waits” between queries and to limit the number of queries they issue. To understand these efficiency problems in more detail, consider the KNOWITALL information extraction system (Etzioni et al., 2005). KNOWITALL has a generateand-test architecture that extracts information in two stages. First, KNOWITALL utilizes a small set of domainindependent extraction patterns to generate candidate facts (cf. (Hearst, 1992)). For example, the generic pattern “NP1 such as NPList2” indicates that the head of each simple noun phrase (NP) in NPList2 is a member of the class named in NP1. By instantiating the pattern for class City, KNOWITALL extracts three candidate cities from the sentence: “We provide tours to cities such as Paris, London, and Berlin.” Note that it must also fetch each document that contains a potential candidate. Next, extending the PMI-IR algorithm (Turney, 2001), KNOWITALL automatically tests the plausibility of the candidate facts it extracts using pointwise mutual information (PMI) statistics computed from search-engine hit counts. For example, to assess the likelihood that “Yakima” is a city, KNOWITALL will compute the PMI between Yakima and a set of k discriminator phrases that tend to have high mutual information with city names (e.g., the simple phrase “city”). Thus, KNOWITALL requires at least k search-engine queries for every candidate extraction it assesses. Due to KNOWITALL’s dependence on search-engine queries, large-scale experiments utilizing KNOWITALL take days and even weeks to complete, which makes research using KNOWITALL slow and cumbersome. Private access to Google-scale infrastructure would provide sufficient access to search queries, but at prohibitive cost, and the problem of fetching documents (even if from a cached copy) would remain (as we discuss in Section 2.1). Is there a feasible alternative Web-based IE system? If so, what size Web index and how many machines are required to achieve reasonable levels of precision/recall? What would the architecture of this IE system look like, and how fast would it run? To address these questions, this paper introduces a novel architecture for web information extraction. It consists of two components that supplant the generateand-test mechanisms in KNOWITALL. To generate extractions rapidly we utilize our own specialized search engine, called the Bindings Engine (or BE), which efficiently returns bindings in response to variabilized queries. For example, in response to the query “Cities such as ProperNoun(Head(〈NounPhrase〉))”, BE will return a list of proper nouns likely to be city names. To assess these extractions, we use URNS, a combinatorial model, which estimates the probability that each extraction is correct without using any additional search engine queries.1 For further efficiency, we introduce an approximation to URNS, based on frequency of extractions’ occurrence in the output of BE, and show that it achieves comparable precision/recall to URNS. Our contributions are as follows: 1. We present a novel architecture for Information Extraction (IE), embodied in the KNOWITNOW system, which does not depend on Web search-engine queries. 2. We demonstrate experimentally that KNOWITNOW is the first system able to extract tens of thousands of facts from the Web in minutes instead of days. 3. We show that KNOWITNOW’s extraction rate is two to three orders of magnitude greater than KNOWITALL’s, but this increased efficiency comes at the cost of reduced recall. We quantify this tradeoff for KNOWITNOW’s 60,000,000 page index and extrapolate how the tradeoff would change with larger indices. Our recent work has described the BE search engine in detail (Cafarella and Etzioni, 2005), and also analyzed the URNS model’s ability to compute accurate probability estimates for extractions (Downey et al., 2005). However, this is the first paper to investigate the composition of these components to create a fast IE system, and to compare it experimentally to KNOWITALL in terms of time, In contrast, PMI-IR, which is built into KNOWITALL, requires multiple search engine queries to assess each potential extraction. recall, precision, and extraction rate. The frequencybased approximation to URNS and the demonstration of its success are also new. The remainder of the paper is organized as follows. Section 2 provides an overview of BE’s design. Section 3 describes the URNS model and introduces an efficient approximation to URNS that achieves similar precision/recall. Section 4 presents experimental results. We conclude with related and future work in Sections 5 and 6. 2 The Bindings Engine This section explains how relying on standard search engines leads to a bottleneck for NLP applications, and provides a brief overview of the Bindings Engine (BE)—our solution to this problem. A comprehensive description of BE appears in (Cafarella and Etzioni, 2005). Standard search engines are computationally expensive for IE and other NLP tasks. IE systems issue multiple queries, downloading all pages that potentially match an extraction rule, and performing expensive processing on each page. For example, such systems operate roughly as follows on the query (“cities such as 〈NounPhrase〉”): 1. Perform a traditional search engine query to find all URLs containing the non-variable terms (e.g., “cities such as”) 2. For each such URL: (a) obtain the document contents, (b) find the searched-for terms (“cities such as”) in the document text, (c) run the noun phrase recognizer to determine whether text following “cities such as” satisfies the linguistic type requirement, (d) and if so, return the string We can divide the algorithm into two stages: obtaining the list of URLs from a search engine, and then processing them to find the 〈NounPhrase〉 bindings. Each stage poses its own scalability and speed challenges. The first stage makes a query to a commercial search engine; while the number of available queries may be limited, a single one executes relatively quickly. The second stage fetches a large number of documents, each fetch likely resulting in a random disk seek; this stage executes slowly. Naturally, this disk access is slow regardless of whether it happens on a locally-cached copy or on a remote document server. The observation that the second stage is slow, even if it is executed locally, is important because it shows that merely operating a “private” search engine does not solve the problem (see Section 2.1). The Bindings Engine supports queries containing typed variables (such as NounPhrase) and string-processing functions (such as “head(X)” or “ProperNoun(X)”) as well as standard query terms. BE processes a variable by returning every possible string in the corpus that has a matching type, and that can be substituted for the variable and still satisfy the user’s query. If there are multiple variables in a query, then all of them must simultaneously have valid substitutions. (So, for example, the query “ is located in ” only returns strings when noun phrases are found on both sides of “is located in”.) We call a string that meets these requirements a binding for the variable in question. These queries, and the bindings they elicit, can usefully serve as part of an information extraction system or other common NLP tasks (such as gathering usage statistics). Figure 1 illustrates some of the queries that BE can handle. president Bush cities such as ProperNoun(Head()) is the CEO of Figure 1: Examples of queries that can be handled by BE. Queries that include typed variables and stringprocessing functions allow NLP tasks to be done efficiently without downloading the original document during query processing. BE’s novel neighborhood index enables it to process these queries with O(k) random disk seeks and O(k) serial disk reads, where k is the number of non-variable terms in its query. As a result, BE can yield orders of magnitude speedup as shown in the asymptotic analysis later in this section. The neighborhood index is an augmented inverted index structure. For each term in the corpus, the index keeps a list of documents in which the term appears and a list of positions where the term occurs, just as in a standard inverted index (Baeza-Yates and RibeiroNeto, 1999). In addition, the neighborhood index keeps a list of left-hand and right-hand neighbors at each position. These are adjacent text strings that satisfy a recognizer for one of the target types, such as NounPhrase. As with a standard inverted index, a term’s list is processed from start to finish, and can be kept on disk as a contiguous piece. The relevant string for a variable binding is included directly in the index, so there is no need to fetch the source document (thus causing a disk seek). Expensive processing such as part-of-speech tagging or shallow syntactic parsing is performed only once, while building the index, and is not needed at query time. It is important to note that simply preprocessing the corpus and placing the results in a database would not avoid disk seeks, as we would still have to explicitly fetch these results. The run-time efficiency of the neighborhood index Query Time Index Space
منابع مشابه
TextJoiner: On-demand Information Extraction with Multi-Pattern Queries
Web Information Extraction (WIE) is the task of automatically extracting knowledge from Web content. On-demand WIE systems such as KNOWITNOW [1] and TEXTRUNNER [2] allow users to query the Web for a textual context that indicates a desired relation. For example, the context “$x invented $y” indicates the Invented(x, y) relation. The WIE system is tasked with responding in real-time with a list ...
متن کاملA New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملAdaptive Information Analysis in Higher Education Institutes
Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...
متن کاملAdaptive Information Analysis in Higher Education Institutes
Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...
متن کاملTextRunner: Open Information Extraction on the Web
Traditional information extraction systems have focused on satisfying precise, narrow, pre-specified requests from small, homogeneous corpora. In contrast, the TextRunner system demonstrates a new kind of information extraction, called Open Information Extraction (OIE), in which the system makes a single, data-driven pass over the entire corpus and extracts a large set of relational tuples, wit...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005